Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria

نویسندگان

  • Chris Drummond
  • Robert C. Holte
چکیده

This paper investigates how the splitting criteria and pruning methods of decision tree learning algorithms are influenced by misclassification costs or changes to the class distribution. Splitting criteria that are relatively insensitive to costs (class distributions) are found to perform as well as or better than, in terms of expected misclassification cost, splitting criteria that are cost sensitive. Consequently there are two opposite ways of dealing with imbalance. One is to combine a costinsensitive splitting criterion with a cost insensitive pruning method to produce a decision tree algorithm little affected by cost or prior class distribution. The other is to grow a cost-independent tree which is then pruned in a cost-sensitive manner. Introduction When applying machine learning to real world classification problems two complications that often arise are imbalanced classes (one class occurs much more often than the other (Kubat, Holte, & Matwin 1998; Ezawa, Singh, & Norton 1996; Fawcett & Provost 1996)) and asymmetric misclassification costs (the cost of misclassifying an example from one class is much larger than the cost of misclassifying an example from the other class (Domingos 1999; Pazzani et al. 1997)). Traditional learning algorithms, which aim to maximize accuracy, treat positive and negative examples as equally important and therefore do not always produce a satisfactory classifier under these conditions. Furthermore, in these circumstances accuracy is not an appropriate measure of classifier performance (Provost, Fawcett, & Kohavi 1998). Class imbalance and asymmetric misclassification costs are related to one another. One way to counteract imbalance is to raise the cost of misclassifying the minority class. Conversely one way to make an algorithm cost sensitive is to intentionally imbalance the training set. In this paper we investigate how the splitting criteria of decision tree learning algorithms are influenced by changes to misclassification costs or class distribution. We show that splitting criteria in common use are relatively insensitive to costs and class distribution; costs and class distribution primarily affect pruning (Breiman et al. 1984, p.94). One criterion, which we refer to as DKM (Kearns & Mansour 1996; To Appear in the Proceedings of Seventeenth International Conference on Machine Learning 2000 Dietterich, Kearns, & Mansour 1996) is completely insensitive to costs and class distributions but in our experiments its performance equals or exceeds that of other splitting criteria. This suggests two different ways of dealing with imbalance and costs. First, instead of artificially adjusting balance by duplicating or discarding examples, a cost-insensitive splitting criterion can be combined with a cost insensitive pruning method to produce a decision tree algorithm little affected by cost or prior class distribution. All the data available can be used to produce the tree, thus throwing away no information, and learning speed is not degraded due to duplicate instances. Alternatively one can grow a costindependent tree which is then pruned in a cost-sensitive manner. Thus the tree need only be grown once, an advantage as growing trees is computationally more expensive than pruning. Measuring Cost Sensitivity We restrict ourselves to two class problems in which the cost of a misclassification depends only on the class not on the individual example. Following (Provost & Fawcett 1998) we use ROC methods to analyze and compare the performance of classifiers. One point in an ROC diagram dominates another if it is above and to the left, i.e. has a higher true positive rate (TP) and a lower false positive rate (FP). If point A dominates point B, A will outperform B for all possible misclassification costs and class distributions. By “outperforms” we typically mean “has lower expected cost”, but (Provost & Fawcett 1998) have shown that dominance in ROC space implies superior performance for a variety of commonly-used performance measures. The slope of the line connecting two ROC points (FP1; TP1) and (FP2; TP2) is given by equation 1 (Provost, Fawcett, & Kohavi 1998; Provost & Fawcett 1997) TP1 TP2 FP1 FP2 = p( )C(+j ) p(+)C( j+) (1) where p(x) is the probability of a given example being in class x, and C(xjy) is the cost incurred if an example in class y is misclassified as being in class x. Equation 1 shows that, for the purpose of evaluating performance in 2class problems, class probabilities (“priors”) and misclassification costs are interchangeable. Doubling p(+) has the From: AAAI Technical Report WS-00-05. Compilation copyright © 2000, AAAI (www.aaai.org). All rights reserved. same effect on performance as doubling the cost C( j+) or halving the cost C(+j ). In the rest of the paper we will freely interchange the two, speaking of costs sometimes and priors other times. A classifier is a single point in ROC space. Point (0,0) represents classifying all examples as negative, (1,1) represents classifying all examples as positive. We call these the trivial classifiers. The slopes of the lines connecting a non-trivial classifier to (0,0) and to (1,1) define the range of cost ratios for which the classifier is potentially useful. For cost ratios outside this range, the classifier will be outperformed by a trivial classifier. It is important in comparing two classifiers not to use a cost ratio outside the operating range of one of them. A classifier's operating range may be much narrower than one intuitively expects. Consider the solid lines in Figure 1. These connect (0,0) and (1,1) to a classifier which is approximately 70% correct on each class. The slopes, shown below the lines, are 0.45 and 2.2. If the cost ratio is outside this range this classifier is outperformed by a trivial classifier. Operating range increases as one moves towards the ideal classifier, (0,1). Therefore if classifier A dominates classifier B, A's operating range will be larger than B's. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate T ru e P os iti ve R at e

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of gestational diabetes prediction with artificial neural network and decision tree models

Background: Gestational diabetes mellitus (GDM) is one of the most common metabolic disorders in pregnancy, which is associated with serious complications. In the event of early diagnosis of this disease, some of the maternal and fetal complications can be prevented. The aim of this study was to early predict gestational diabetes mellitus by two statistical models including artificial neural ne...

متن کامل

هزینه – اثربخشی  بکارگیری کارشناسان بینائی سنجی  بجای مربیان آموزش دیده در

Background and Aim: Amblyopia is an important cause of weak vision and blindness. A preliminary study on validity of the current screening program in Shahrood City's kindergartens showed that this method may not be efficient enough, especially because of low sensitivity and referral problems. We tried to compare efficiencies and costs of screening by trained staff and optometrists in detecting ...

متن کامل

Modelling Customer Attraction Prediction in Customer Relation Management using Decision Tree: A Data Mining Approach

In Today’s quality- based competitive world, known as knowledge age, customer attraction is of ultimate importance. In respect to the slogan “customer is always right”, customer relation management is the core of an organizational strategy playing an important role in four aspects of customer identification, customer attraction, customer retaining, and customer satisfaction. Commercial organiza...

متن کامل

Comparing different stopping criteria for fuzzy decision tree induction through IDFID3

Fuzzy Decision Tree (FDT) classifiers combine decision trees with approximate reasoning offered by fuzzy representation to deal with language and measurement uncertainties. When a FDT induction algorithm utilizes stopping criteria for early stopping of the tree's growth, threshold values of stopping criteria will control the number of nodes. Finding a proper threshold value for a stopping crite...

متن کامل

Comparison of Gestational Diabetes Prediction Between Logistic Regression, Discriminant Analysis, Decision Tree and Artificial Neural Network Models

Background and Objectives: Gestational Diabetes Mellitus (GDM) is the most common metabolic disorder in pregnancy. In case of early detection, some of its complications can be prevented. The aim of this study was to investigate early prediction of GDM by logistic regression (LR), discriminant analysis (DA), decision tree (DT) and perceptron artificial neural network (ANN) and to compare these m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000